Text augmentation preserving persona speech style and vocabulary
Annotation
Currently, various natural language processing tasks often require large data sets. However, for many tasks, collecting large datasets is quite tedious and expensive, and requires the involvement of experts. An increase in the amount of data can be achieved using methods of data augmentation, however, the use of classical approaches can lead to the inclusion of phrases in the data corpus that differ in the speech style and vocabulary of the target person, which can lead to both a change in the target class as well as the appearance of replicas with unnatural vocabulary use and lack of meaning. In this context, a new method for test data enrichment is proposed that takes into account the person’s style and vocabulary. In this article, a new method for expanding text data that preserves individual language features and vocabulary is proposed. The core of the method is to create individual templates for each person based on the analysis of syntactic trees of propositions and then to create new replicas according to the generated templates. The method was tested on the task of assessing the user’s emotional state in a dialogue. The search was carried out for data sets in English and Russian. The proposed method made it possible to improve the quality of solving these problems for both the English and Russian languages. Up to a 2 % increase in accuracy and weighted F1 metrics has been noted for various models. The results of the work can be applied to improve the accuracy and weighted F1 metrics of models designed to solve various problems for the English and Russian languages.
Keywords
Постоянный URL
Articles in current issue
- Determination of the action type of hydrate formationinhibitors by their infrared spectra
- Application of Raman spectroscopy to study the inactivation process of bacterial microorganisms
- Numerical study of the effect of methemoglobin concentration in the blood on the absorption of light by human skin.
- Low-temperature cell for IR Fourier spectrometric investigation of hydrocarbon substances
- Peculiarities of growing Ga1–xInxAs solid solutions on GaAs substrates in the field of a temperature gradient through a thin gas zone
- An enhanced AES-GCM based security protocol for securing the IoT communication
- Attacks based on malicious perturbations on image processing systems and defense methods against them
- Brain MRT image super resolution using discrete cosine transform and convolutional neural network
- Verification of event-driven software systems using the specification language of cooperating automata objects
- Intelligent adaptive testing system
- Neural network-based method for visual recognition of driver’s voice commands using attention mechanism
- Brain tumour segmentation in MRI using fuzzy deformable fusion model with Dolphin-SCA
- Optimization of human tracking systems in virtual reality based on a neural network approach
- Errors in the demodulation algorithm with a generated carrier phase introduted by the low-pass filter
- Modeling of the process of spherical form correction for rotors of electrostatically suspended gyros
- Method of spatial multiplexing in multi-antenna communication systems
- Modeling and simulation of heat exchanger with strong dependence of oil viscosity on temperature
- Approach to the generalized parameters formation of the complex technical systems technical condition using neural network structures
- Numerical simulation of gas dynamics during operation of a wide-range rocket nozzle with a porous insert
- The exact solution of a shock wave reflection problem from a wall shielded by a gas suspension layer
- Adaptive observer for state variables of a time-varying nonlinear system with unknown constant parameters and delayed measurements
- RuLegalNER: a new dataset for Russian legal named entities recognition